Deep Music Genre Classification

Natural Language Processing
Neural Networks
Deep Learning
Deep Music Genre Classification in Python
Author

Lukka Wolff

Published

May 12, 2025

Abstract

In this blog post, we explore a deep learning approach to predicting the genres of music tracks. We leverage both song lyrics and engineered metadata features. We tokenize the lyrics with a BERT tokenizer and make use of Spotify’s engineered audio–semantic features (e.g., acousticness, danceability, thematic tags). We implement three neural networks: a lyric-based model, a metadata-only network, and a combined network that uses both lyric embeddings and engineered features. We compare how our different models stack up against one another and our base rate to assess the success of our different approaches to genre prediction.

Data

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

from torchinfo import summary

import pandas as pd
import numpy as np
import time

# for train-test split
from sklearn.model_selection import train_test_split

# for suppressing bugged warnings from torchinfo
import warnings
warnings.filterwarnings("ignore", category = UserWarning)

# tokenizers from HuggingFace
from transformers import BertTokenizer

# for building condensed vocab sets
# from torchtext.vocab import build_vocab_from_iterator

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
c:\Users\lukka\anaconda3\envs\ml-0451\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

We are loading in a Kaggle dataset that contains information about music made between the years 1950 and 2019 collected through Spotify. The dataset contains lyrics, artist info, track names, etc. Importantly it also includes music metadata like sadness, danceability, loudness, acousticness, etc.

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)

Lets have a look at some of the raw data!

df.head()
Unnamed: 0 artist_name track_name release_date genre lyrics len dating violence world/life ... sadness feelings danceability loudness acousticness instrumentalness valence energy topic age
0 0 mukesh mohabbat bhi jhoothi 1950 pop hold time feel break feel untrue convince spea... 95 0.000598 0.063746 0.000598 ... 0.380299 0.117175 0.357739 0.454119 0.997992 0.901822 0.339448 0.137110 sadness 1.0
1 4 frankie laine i believe 1950 pop believe drop rain fall grow believe darkest ni... 51 0.035537 0.096777 0.443435 ... 0.001284 0.001284 0.331745 0.647540 0.954819 0.000002 0.325021 0.263240 world/life 1.0
2 6 johnnie ray cry 1950 pop sweetheart send letter goodbye secret feel bet... 24 0.002770 0.002770 0.002770 ... 0.002770 0.225422 0.456298 0.585288 0.840361 0.000000 0.351814 0.139112 music 1.0
3 10 pérez prado patricia 1950 pop kiss lips want stroll charm mambo chacha merin... 54 0.048249 0.001548 0.001548 ... 0.225889 0.001548 0.686992 0.744404 0.083935 0.199393 0.775350 0.743736 romantic 1.0
4 12 giorgos papadopoulos apopse eida oneiro 1950 pop till darling till matter know till dream live ... 48 0.001350 0.001350 0.417772 ... 0.068800 0.001350 0.291671 0.646489 0.975904 0.000246 0.597073 0.394375 romantic 1.0

5 rows × 31 columns

Here is a brief look at how many songs we have in each represented genre.

df.groupby("genre").size()
genre
blues      4604
country    5445
hip hop     904
jazz       3845
pop        7042
reggae     2498
rock       4034
dtype: int64

This is a pretty large number of songs to classify… and some genres I personally dont care for. So, to make the dataframe more manageable and applicable to me personally, we are going to narrow down to only observe reggae, hip hop, rock and jazz.

genres = {
    "hip hop"   : 0,
    "jazz" : 1,
    "reggae" : 2,
    "rock" : 3,
}

df = df[df["genre"].apply(lambda x: x in genres.keys())]
df.head()
Unnamed: 0 artist_name track_name release_date genre lyrics len dating violence world/life ... sadness feelings danceability loudness acousticness instrumentalness valence energy topic age
17091 54304 gene ammons it's the talk of the town 1950 jazz lovers sweethearts hard understand know happen... 61 0.001096 0.001096 0.001096 ... 0.319570 0.001096 0.352323 0.620388 0.868474 0.235830 0.430132 0.282260 sadness 1.0
17092 54305 gene ammons you go to my head 1950 jazz head linger like haunt refrain spin round brai... 48 0.001754 0.340964 0.001754 ... 0.001754 0.001754 0.379400 0.638541 0.907630 0.900810 0.221970 0.184159 violence 1.0
17093 54307 bud powell yesterdays 1950 jazz music speak start hear musicians like dizzy gi... 107 0.001144 0.001144 0.074762 ... 0.001144 0.097082 0.489873 0.467400 0.992972 0.927126 0.334295 0.228204 music 1.0
17094 54311 tony bennett stranger in paradise 1950 jazz hand stranger paradise lose wonderland strange... 41 0.002105 0.180524 0.002105 ... 0.527429 0.002105 0.179032 0.559470 0.983936 0.001781 0.086974 0.235211 sadness 1.0
17095 54313 dean martin zing-a zing-a zing boom 1950 jazz zinga zinga zinga zinga zinga zinga zinga zing... 160 0.001253 0.001253 0.001253 ... 0.425721 0.001253 0.580851 0.687409 0.655622 0.000000 0.936109 0.418400 sadness 1.0

5 rows × 31 columns

df["genre"] = df["genre"].apply(genres.get)
df
Unnamed: 0 artist_name track_name release_date genre lyrics len dating violence world/life ... sadness feelings danceability loudness acousticness instrumentalness valence energy topic age
17091 54304 gene ammons it's the talk of the town 1950 1 lovers sweethearts hard understand know happen... 61 0.001096 0.001096 0.001096 ... 0.319570 0.001096 0.352323 0.620388 0.868474 0.235830 0.430132 0.282260 sadness 1.000000
17092 54305 gene ammons you go to my head 1950 1 head linger like haunt refrain spin round brai... 48 0.001754 0.340964 0.001754 ... 0.001754 0.001754 0.379400 0.638541 0.907630 0.900810 0.221970 0.184159 violence 1.000000
17093 54307 bud powell yesterdays 1950 1 music speak start hear musicians like dizzy gi... 107 0.001144 0.001144 0.074762 ... 0.001144 0.097082 0.489873 0.467400 0.992972 0.927126 0.334295 0.228204 music 1.000000
17094 54311 tony bennett stranger in paradise 1950 1 hand stranger paradise lose wonderland strange... 41 0.002105 0.180524 0.002105 ... 0.527429 0.002105 0.179032 0.559470 0.983936 0.001781 0.086974 0.235211 sadness 1.000000
17095 54313 dean martin zing-a zing-a zing boom 1950 1 zinga zinga zinga zinga zinga zinga zinga zing... 160 0.001253 0.001253 0.001253 ... 0.425721 0.001253 0.580851 0.687409 0.655622 0.000000 0.936109 0.418400 sadness 1.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28367 82447 mack 10 10 million ways 2019 0 cause fuck leave scar tick tock clock come kno... 78 0.001350 0.001350 0.001350 ... 0.065664 0.001350 0.889527 0.759711 0.062549 0.000000 0.751649 0.695686 obscene 0.014286
28368 82448 m.o.p. ante up (robbin hoodz theory) 2019 0 minks things chain ring braclets yap fame come... 67 0.001284 0.001284 0.035338 ... 0.001284 0.001284 0.662082 0.789580 0.004607 0.000002 0.922712 0.797791 obscene 0.014286
28369 82449 nine whutcha want? 2019 0 get ban get ban stick crack relax plan attack ... 77 0.001504 0.154302 0.168988 ... 0.001504 0.001504 0.663165 0.726970 0.104417 0.000001 0.838211 0.767761 obscene 0.014286
28370 82450 will smith switch 2019 0 check check yeah yeah hear thing call switch g... 67 0.001196 0.001196 0.001196 ... 0.001196 0.001196 0.883028 0.786888 0.007027 0.000503 0.508450 0.885882 obscene 0.014286
28371 82451 jeezy r.i.p. 2019 0 remix killer alive remix thriller trap bitch s... 83 0.001012 0.075202 0.001012 ... 0.001012 0.033995 0.828875 0.674794 0.015862 0.000000 0.475474 0.492477 obscene 0.014286

11281 rows × 31 columns

The base rate on our classification is the proportion of the data set occupied by the largest label class:

df.groupby("genre").size() / len(df)
genre
0    0.080135
1    0.340839
2    0.221434
3    0.357592
dtype: float64

If we always guessed category 3, then we would expect an accuracy of roughly 36%. So, our task will be to see whether we can train a model to beat this.

As we try to predict the genre of the track, we will use lyrics alongside some other engineered features (metadata) that we define below.

engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']      

Our models will only need these engineered features, lyrics, and our target value which will be genre so we can throw them all into the same dataframe and use slicing to access different parts later.

df_clean= df[engineered_features + ['lyrics', 'genre']].copy()
df_clean.head()
dating violence world/life night/time shake the audience family/gospel romantic communication obscene music ... sadness feelings danceability loudness acousticness instrumentalness valence energy lyrics genre
17091 0.001096 0.001096 0.001096 0.001096 0.036316 0.001096 0.001096 0.460773 0.086498 0.001096 ... 0.319570 0.001096 0.352323 0.620388 0.868474 0.235830 0.430132 0.282260 lovers sweethearts hard understand know happen... 1
17092 0.001754 0.340964 0.001754 0.001754 0.001754 0.001754 0.131872 0.001754 0.001754 0.001754 ... 0.001754 0.001754 0.379400 0.638541 0.907630 0.900810 0.221970 0.184159 head linger like haunt refrain spin round brai... 1
17093 0.001144 0.001144 0.074762 0.046173 0.001144 0.018789 0.001144 0.001655 0.001144 0.421734 ... 0.001144 0.097082 0.489873 0.467400 0.992972 0.927126 0.334295 0.228204 music speak start hear musicians like dizzy gi... 1
17094 0.002105 0.180524 0.002105 0.002105 0.002105 0.002105 0.002105 0.201965 0.002105 0.002105 ... 0.527429 0.002105 0.179032 0.559470 0.983936 0.001781 0.086974 0.235211 hand stranger paradise lose wonderland strange... 1
17095 0.001253 0.001253 0.001253 0.001253 0.001253 0.081126 0.001253 0.111951 0.001253 0.268737 ... 0.425721 0.001253 0.580851 0.687409 0.655622 0.000000 0.936109 0.418400 zinga zinga zinga zinga zinga zinga zinga zing... 1

5 rows × 24 columns

Finally, we will perform a train-validation split to later evaluate our data

df_train, df_val = train_test_split(df_clean,shuffle = True, test_size = 0.2)

Text Vectorization

We now need to vectorize the lyrics. We’re going to use tokenization to break up the lyrics into a sequence of tokens, and then vectorize that sequence.

We will be using a tokenizer imported from HuggingFace.

tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

For our purposes it’s more convenient to assign an integer to each token, which we can do like this:

encoded = tokenizer("I love reggae music!")
encoded
{'input_ids': [101, 1045, 2293, 15662, 2189, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

To do the reverse, we can use the .decode method of the tokenizer:

tokenizer.decode(encoded["input_ids"])
'[CLS] i love reggae music! [SEP]'

Here is some code to help us prepare our dataset with encodings. A lot of our lyrics are different lengths so we will pad the shorter ones with 0s and truncate others that are especially long. We will make use of the torch Dataset class to help manage our data.

max_len = 512 # BERT capacity

def preprocess(df, tokenizer, max_len):
    lyrics_tokens = tokenizer(list(df["lyrics"]), padding="max_length", truncation=True, max_length=max_len)["input_ids"]
    engineered = df[engineered_features].values.tolist()
    y = list(df["genre"])
    return lyrics_tokens, engineered, y

class TextDataFromDF(Dataset):
    def __init__(self, df):
        self.lyrics_tokens, self.engineered_feats, self.y = preprocess(df, tokenizer, max_len)

    def __getitem__(self, ix):
        return self.lyrics_tokens[ix], self.engineered_feats[ix], self.y[ix]

    def __len__(self):
        return len(self.y)

Lets make our encoded datasets!

train_data = TextDataFromDF(df_train)
val_data   = TextDataFromDF(df_val)

Here is what a single songs information looks like now:

X_tokens, X_feats, y = train_data[1]
print(X_tokens, X_feats)
print(y)
[101, 2372, 2113, 21209, 6887, 16585, 2477, 2111, 8501, 3613, 9266, 2213, 9680, 2444, 9152, 23033, 2015, 10675, 2015, 4401, 2991, 4533, 4952, 11898, 10432, 12170, 9102, 6510, 8081, 4485, 2729, 10667, 14033, 6510, 2131, 2477, 2175, 4485, 2131, 2518, 3861, 2272, 2420, 2208, 16371, 4246, 7047, 8046, 4485, 2215, 4485, 4248, 4355, 7281, 7579, 6841, 16360, 8091, 4485, 4982, 4503, 14255, 23344, 2227, 2131, 3947, 3238, 2444, 11565, 10020, 2102, 3305, 2514, 2665, 3259, 2192, 2518, 2903, 2066, 8554, 10421, 7200, 5223, 2342, 2757, 11274, 4372, 14540, 10696, 2111, 2219, 3828, 2111, 13660, 3240, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0.0011961722985867, 0.1300521675514939, 0.1951359532512352, 0.0011961722561089, 0.0011961723204168, 0.0011961722655141, 0.0011961722799619, 0.1710064773999869, 0.3944457875450848, 0.0011961722772403, 0.0011961723013686, 0.0011961723181082, 0.0522687272336512, 0.0011961722965877, 0.0011961723815529, 0.0415406469256242, 0.4476334885735947, 0.7206368740866087, 0.0736938490902099, 0.0, 0.6589035449299256, 0.6446335461127515]
2

We are going to be feeding data in in batches, so we will need a dataloader which necessitates a collate function to ensure our we are imputing tensors of the right size.

def collate(data):
    tokens = torch.tensor([d[0] for d in data], dtype=torch.long)
    engineered = torch.tensor([d[1] for d in data], dtype=torch.float)
    y = torch.tensor([d[2] for d in data], dtype=torch.long)
    return (tokens, engineered), y

train_loader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn = collate)
val_loader = DataLoader(val_data, batch_size=8, shuffle=True, collate_fn = collate)

Here is what a batch of data looks like. The predictor data is now a tensor in which the entries give token indices, padded with 0s and another tensor with the values of our engineered features. For visualization purposes we’ll show only the first 2 rows:

X, y = next(iter(train_loader))
X[:2]
(tensor([[  101,  2621,  4553,  ...,     0,     0,     0],
         [  101,  2668, 14740,  ...,     0,     0,     0],
         [  101,  2305,  2272,  ...,     0,     0,     0],
         ...,
         [  101,  2051,  2621,  ...,     0,     0,     0],
         [  101,  2051,  2202,  ...,     0,     0,     0],
         [  101,  5949,  2773,  ...,     0,     0,     0]]),
 tensor([[2.5063e-03, 2.5063e-03, 3.3226e-01, 9.8139e-02, 2.5063e-03, 2.5063e-03,
          2.5063e-03, 2.5063e-03, 2.5063e-03, 1.2809e-01, 2.5063e-03, 2.5063e-03,
          2.5063e-03, 2.8335e-01, 2.5063e-03, 2.5063e-03, 5.5594e-01, 7.7276e-01,
          2.0883e-02, 1.1235e-05, 2.8174e-01, 4.7846e-01],
         [1.8797e-03, 5.0473e-01, 1.8797e-03, 1.8797e-03, 3.7594e-02, 3.9971e-02,
          1.8797e-03, 1.8797e-03, 1.8797e-03, 1.8797e-03, 1.8797e-03, 1.8797e-03,
          1.3261e-01, 1.8797e-03, 2.2307e-01, 1.8797e-03, 8.1155e-01, 7.5999e-01,
          8.7248e-02, 2.5304e-03, 5.6513e-01, 6.5364e-01],
         [1.9493e-03, 1.9493e-03, 3.4622e-01, 1.9493e-03, 1.9493e-03, 1.9493e-03,
          1.9493e-03, 1.9493e-03, 1.9493e-03, 1.9493e-03, 2.8898e-01, 3.3361e-01,
          1.9493e-03, 1.9493e-03, 1.9493e-03, 1.9493e-03, 6.1118e-01, 5.5086e-01,
          9.5181e-01, 1.8725e-01, 3.4151e-01, 2.2320e-01],
         [5.7208e-04, 5.7208e-04, 5.1883e-02, 5.7208e-04, 6.8732e-02, 2.8145e-02,
          5.7208e-04, 2.5657e-01, 3.6477e-01, 5.7208e-04, 2.2247e-01, 5.7208e-04,
          5.7208e-04, 5.7208e-04, 5.7208e-04, 5.7208e-04, 7.4764e-01, 6.9715e-01,
          1.6867e-01, 0.0000e+00, 8.4233e-01, 4.8046e-01],
         [1.9611e-02, 3.9300e-01, 2.4161e-01, 8.9206e-04, 1.8294e-02, 7.0961e-02,
          8.9206e-04, 8.9206e-04, 8.9206e-04, 8.9206e-04, 8.9206e-04, 1.4598e-01,
          4.5708e-02, 5.5025e-02, 8.9206e-04, 8.9206e-04, 8.0180e-01, 6.7451e-01,
          5.6827e-05, 9.4838e-01, 9.0210e-01, 7.9379e-01],
         [1.0526e-03, 1.0526e-03, 1.0526e-03, 3.2379e-01, 1.0526e-03, 1.0526e-03,
          9.0729e-02, 1.7966e-01, 1.0526e-03, 1.0713e-01, 1.0526e-03, 2.5740e-01,
          1.0526e-03, 1.0526e-03, 1.0526e-03, 2.7609e-02, 2.0286e-01, 6.0398e-01,
          9.4478e-01, 5.6883e-05, 5.1731e-02, 2.8426e-01],
         [1.5038e-03, 1.8747e-01, 1.5038e-03, 1.6175e-01, 1.5038e-03, 1.5038e-03,
          1.5038e-03, 1.5038e-03, 4.1537e-02, 1.5038e-03, 1.5038e-03, 1.5038e-03,
          1.5038e-03, 1.5038e-03, 5.2697e-01, 1.5038e-03, 5.4403e-01, 7.8607e-01,
          6.8374e-06, 8.3806e-01, 3.3739e-01, 9.4795e-01],
         [2.9240e-03, 2.9240e-03, 2.9240e-03, 1.0013e-01, 2.9240e-03, 2.9240e-03,
          2.9240e-03, 2.9240e-03, 2.9240e-03, 1.7470e-01, 2.9240e-03, 2.9240e-03,
          2.9240e-03, 9.0332e-02, 4.7758e-01, 2.9240e-03, 6.3609e-01, 7.2769e-01,
          5.7932e-01, 3.3603e-01, 6.4963e-01, 7.6075e-01]]))
y[:2]
tensor([3, 2])

Model Building

We are going to train three neural networks to classify our genres.

  • Using Lyrics to Classify
  • Using Engineered Features (Metadata) to Classify
  • Using Lyrics and Metadata to Classify

Lets build a model for classifying genres based on lyrics first.

Lyrical Classification

class TextClassificationModel(nn.Module):

    def __init__(self,vocab_size, embedding_dim, max_len, num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.dropout = nn.Dropout(0.2)
        self.fc_flat = nn.Linear(embedding_dim, embedding_dim)
        self.fc = nn.Linear(embedding_dim, num_class) # max_len*embedding_dim
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.embedding(x)
        x = self.fc_flat(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = x.mean(axis = 1)
        # x = torch.flatten(x, 1)
        x = self.fc(x)
        return(x)

Our model begins with the embedding layer where each word is looked up in an embedding table and turned into a learned vector of size embedding_dim. Immediately after embedding, we pass each token’s embedding through a small fully-connected layer then a ReLU activation, the fully connected layer lets the model learn a richer representation before pooling. We then pass the embedding into a dropout layer where 20% of the embedding vectors are randomly zeroed. This is a form of regularization step meant to help us not be over-reliant on certain tokens. Our mean-pool layer reduces our dimension by averaging all token embeddings so each song is now a fixed-size vector. Finally, our linear layer gives us our probabilities for each genre.

Let’s have a look at it!

vocab_size = len(tokenizer.vocab)
embedding_dim = 32
num_class = len(genres)

text_model = TextClassificationModel(vocab_size, embedding_dim, max_len, num_class).to(device)
summary(text_model, input_Size = (8, max_len))
=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
TextClassificationModel                  --
├─Embedding: 1-1                         976,736
├─Dropout: 1-2                           --
├─Linear: 1-3                            1,056
├─Linear: 1-4                            132
├─ReLU: 1-5                              --
=================================================================
Total params: 977,924
Trainable params: 977,924
Non-trainable params: 0
=================================================================

We have a huge amount of trainable parameters! We could make this architecture more lightweight by changing the size of our embedding dimension.

Below, we define our training loop which can be used for all of our three models that we will define shortly. We define an accuracy function that we will use to evaluate the accuracy of our model and another to evaluate the per class accuracy.

def train(model, dataloader, mode="lyrics", vocab_freq=False):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    loss_fn = torch.nn.CrossEntropyLoss()

    epoch_start_time = time.time()
    # keep track of some counts for measuring accuracy
    total_acc, total_count = 0, 0
    
    for X, y in dataloader:
        # unpack and move to device
        tokens, engineered = X
        y = y.to(device)

        if mode == "lyrics":
            """
            if vocab_freq:
                vocab = build_vocab_from_iterator(tokens, specials=["<unk>"], min_freq = 50)
                tokens = torch.tensor(vocab)
            """
            data = tokens.to(device)
        elif mode == "engineered":
            data = engineered.to(device)
        else:
            data = X

        # zero gradients
        optimizer.zero_grad()
        # form prediction on batch
        predicted_label = model(data)
        # evaluate loss on prediction
        loss = loss_fn(predicted_label, y)
        # compute gradient
        loss.backward()
        # take an optimization step
        optimizer.step()
                
        # for printing accuracy
        total_acc += (predicted_label.argmax(1) == y).sum().item()
        total_count += y.size(0)

    print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')

def accuracy(model, dataloader, mode="lyrics"):
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for X, y in dataloader:
            # unpack and move to device
            tokens, engineered = X
            y = y.to(device)

            if mode == "lyrics":
                data = tokens.to(device)
            elif mode == "engineered":
                data = engineered.to(device)
            elif mode == "both":
                data = X

            predicted_label = model(data)
            total_acc += (predicted_label.argmax(1) == y).sum().item()
            total_count += y.size(0)
    return total_acc/total_count

def per_class_accuracy(model, dataloader, mode="lyrics", num_classes=4):
    model.eval()
    correct = [0] * num_classes
    total   = [0] * num_classes

    with torch.no_grad():
        for X, y in dataloader:
            tokens, engineered = X
            y = y.to(device)

            if mode == "lyrics":
                data = tokens.to(device)
            elif mode == "engineered":
                data = engineered.to(device)
            else:
                data = X

            outputs = model(data)
            preds = outputs.argmax(dim=1)

            for cls in range(len(correct)):
                mask = (y == cls)
                total[cls] += mask.sum().item()
                correct[cls] += ((preds == cls) & mask).sum().item()

    return {
        cls: (correct[cls] / total[cls] if total[cls] > 0 else 0.0)
        for cls in range(len(correct))
    }

Now that we have those functions, lets jump right in and see how our model does when training on lyrics!

EPOCHS = 25
for epoch in range(1, EPOCHS + 1):
    train(text_model, train_loader, "lyrics")
    print("     test accuracy  ", accuracy(text_model, val_loader))
| epoch   1 | train accuracy    0.379 | time:  4.47s
     test accuracy   0.3540097474523704
| epoch   2 | train accuracy    0.398 | time:  3.97s
     test accuracy   0.39787328311918474
| epoch   3 | train accuracy    0.425 | time:  3.94s
     test accuracy   0.4262295081967213
| epoch   4 | train accuracy    0.457 | time:  4.03s
     test accuracy   0.42977403633141337
| epoch   5 | train accuracy    0.498 | time:  4.06s
     test accuracy   0.46256092157731504
| epoch   6 | train accuracy    0.556 | time:  3.77s
     test accuracy   0.4980062029242357
| epoch   7 | train accuracy    0.600 | time:  3.75s
     test accuracy   0.538325210456358
| epoch   8 | train accuracy    0.642 | time:  3.64s
     test accuracy   0.5578201151971643
| epoch   9 | train accuracy    0.674 | time:  3.64s
     test accuracy   0.5604785112981834
| epoch  10 | train accuracy    0.695 | time:  3.67s
     test accuracy   0.5746566238369517
| epoch  11 | train accuracy    0.712 | time:  3.68s
     test accuracy   0.5813026140894993
| epoch  12 | train accuracy    0.729 | time:  3.74s
     test accuracy   0.5795303500221533
| epoch  13 | train accuracy    0.745 | time:  4.02s
     test accuracy   0.5724412937527692
| epoch  14 | train accuracy    0.757 | time:  3.94s
     test accuracy   0.5777580859548073
| epoch  15 | train accuracy    0.767 | time:  4.04s
     test accuracy   0.5764288879042977
| epoch  16 | train accuracy    0.782 | time:  3.85s
     test accuracy   0.5746566238369517
| epoch  17 | train accuracy    0.795 | time:  3.86s
     test accuracy   0.5755427558706248
| epoch  18 | train accuracy    0.799 | time:  3.91s
     test accuracy   0.5684536996012406
| epoch  19 | train accuracy    0.813 | time:  4.09s
     test accuracy   0.5755427558706248
| epoch  20 | train accuracy    0.821 | time:  3.99s
     test accuracy   0.5737704918032787
| epoch  21 | train accuracy    0.831 | time:  4.46s
     test accuracy   0.5693398316349136
| epoch  22 | train accuracy    0.840 | time:  4.20s
     test accuracy   0.5742135578201152
| epoch  23 | train accuracy    0.849 | time:  4.23s
     test accuracy   0.5622507753655295
| epoch  24 | train accuracy    0.857 | time:  4.39s
     test accuracy   0.561807709348693
| epoch  25 | train accuracy    0.859 | time:  4.26s
     test accuracy   0.562693841382366
accuracy(text_model, val_loader)
0.5666814355338945

An accuracy around 56% may not seem all that great at first glance… however, lets remember our base rate was 36%, so despite the fact that we don’t have a particularly high accuracy we can still say that this model is successful!

Let’s look at our accuracy on each of our genres. A quick reminder that our genre keys are: - hip hop: 0 - jazz: 1 - reggae: 2 - rock: 3

per_class_accuracy(text_model, val_loader, mode="lyrics")
{0: 0.47701149425287354,
 1: 0.5816326530612245,
 2: 0.5031055900621118,
 3: 0.6090686274509803}

Even our weakest genre (hip hop at around 48%) comfortably exceeds the base rate! Our model is indeed learning useful signals from the lyrics. Our best performances were on jazz and rock that may suggest that those lyrics have more distinct stylistic patterns. Hip hop and reggae, on the other hand, may have suffered because of slang or patois lyrics or possibly thematic overlap.

Engineered Features Classification

Let’s tackle using our engineered features to try and determine song genres!

class MetadataClassificationModel(nn.Module):

    def __init__(self, num_features, num_class):
        super().__init__()
    
        self.pipeline = nn.Sequential(
            nn.Linear(num_features, 18), 
            nn.ReLU(),
            nn.Linear(18, 12), 
            nn.ReLU(),
            nn.Linear(12, 8), 
            nn.ReLU(),
            nn.Linear(8, num_class)
            )

    def forward(self, x):
        return self.pipeline(x)

    def predict(self, x): 
        return self.score(x) > 0

This is a pretty simple architecture for our engineered features of which there are twenty-two. We are using a series of fully-connected linear layers, each punctuated by a ReLU nonlinearity activation function.

num_features = len(engineered_features)

meta_model = MetadataClassificationModel(num_features, num_class).to(device)
summary(meta_model, input_Size = (8, max_len))
=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
MetadataClassificationModel              --
├─Sequential: 1-1                        --
│    └─Linear: 2-1                       414
│    └─ReLU: 2-2                         --
│    └─Linear: 2-3                       228
│    └─ReLU: 2-4                         --
│    └─Linear: 2-5                       104
│    └─ReLU: 2-6                         --
│    └─Linear: 2-7                       36
=================================================================
Total params: 782
Trainable params: 782
Non-trainable params: 0
=================================================================

This model is pretty lightweight compared to the lyric based model. Lets see how it performs!

EPOCHS = 25
for epoch in range(1, EPOCHS + 1):
    train(meta_model, train_loader, "engineered")
    print("     test accuracy  ", accuracy(meta_model, val_loader, "engineered"))
| epoch   1 | train accuracy    0.459 | time:  4.19s
     test accuracy   0.5002215330084182
| epoch   2 | train accuracy    0.589 | time:  4.00s
     test accuracy   0.615861763402747
| epoch   3 | train accuracy    0.636 | time:  4.65s
     test accuracy   0.6278245458573327
| epoch   4 | train accuracy    0.643 | time:  3.95s
     test accuracy   0.6371289322108994
| epoch   5 | train accuracy    0.645 | time:  3.90s
     test accuracy   0.6371289322108994
| epoch   6 | train accuracy    0.650 | time:  3.75s
     test accuracy   0.640230394328755
| epoch   7 | train accuracy    0.649 | time:  3.73s
     test accuracy   0.6357997341603899
| epoch   8 | train accuracy    0.652 | time:  3.78s
     test accuracy   0.642002658396101
| epoch   9 | train accuracy    0.649 | time:  3.51s
     test accuracy   0.6468763845813026
| epoch  10 | train accuracy    0.651 | time:  3.73s
     test accuracy   0.640230394328755
| epoch  11 | train accuracy    0.649 | time:  3.52s
     test accuracy   0.6513070447496677
| epoch  12 | train accuracy    0.654 | time:  3.54s
     test accuracy   0.641116526362428
| epoch  13 | train accuracy    0.654 | time:  3.74s
     test accuracy   0.6504209127159947
| epoch  14 | train accuracy    0.655 | time:  3.39s
     test accuracy   0.6526362428001772
| epoch  15 | train accuracy    0.656 | time:  3.48s
     test accuracy   0.6424457244129376
| epoch  16 | train accuracy    0.654 | time:  3.31s
     test accuracy   0.6464333185644661
| epoch  17 | train accuracy    0.657 | time:  3.27s
     test accuracy   0.6451041205139566
| epoch  18 | train accuracy    0.654 | time:  3.38s
     test accuracy   0.6442179884802836
| epoch  19 | train accuracy    0.658 | time:  3.29s
     test accuracy   0.642002658396101
| epoch  20 | train accuracy    0.660 | time:  3.32s
     test accuracy   0.6446610544971201
| epoch  21 | train accuracy    0.658 | time:  3.24s
     test accuracy   0.6477625166149756
| epoch  22 | train accuracy    0.657 | time:  3.14s
     test accuracy   0.6482055826318122
| epoch  23 | train accuracy    0.658 | time:  3.21s
     test accuracy   0.6477625166149756
| epoch  24 | train accuracy    0.656 | time:  3.17s
     test accuracy   0.6455471865307931
| epoch  25 | train accuracy    0.655 | time:  3.09s
     test accuracy   0.6530793088170137
accuracy(meta_model, val_loader, "engineered")
0.6530793088170137

Woah! Only using metadata, we achieved around 65% accuracy! This much better than our base rate, and higher than the lyrics only classification approach.

per_class_accuracy(meta_model, val_loader, mode="engineered")
{0: 0.5689655172413793,
 1: 0.6224489795918368,
 2: 0.6128364389233955,
 3: 0.7242647058823529}

We are also outperforming all of our base rates for each genre! Once again rock is our highest performer (around 72%) showing its distinction from other genres in categories like instrumentalness, energy, movement/places, etc.

Combined Feature Classification

We have now explored successful approaches using lyrics and using metadata. Lets see how we perform when we combine the two!

class CombinedNet(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, num_class, num_features):
        super().__init__()
    
        # engineered features pipeline
        self.eng_pipeline = nn.Sequential(
            nn.Linear(num_features, 18), 
            nn.ReLU(),
            nn.Linear(18, 12), 
            nn.ReLU(),
            nn.Linear(12, 8)
            )
        
        # text pipeline 
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.relu = nn.ReLU()
        self.fc = nn.Linear(embedding_dim, 8)

        # combine the two pipelines
        self.combine = nn.Sequential(
            nn.Linear(16, 12), 
            nn.ReLU(),
            nn.Linear(12, 8), 
            nn.ReLU(),
            nn.Linear(8, num_class)
        )
    
    def forward(self, x):
        x_text, x_eng = x
        x_text = x_text.to(device)  
        x_eng = x_eng.to(device)
        
        # text pipeline:
        x_text = self.embedding(x_text)
        x_text = self.relu(x_text)
        x_text = x_text.mean(axis = 1)
        x_text = self.fc(x_text)

        # engineered features pipeline:
        x_eng = self.eng_pipeline(x_eng)

        # then, combine them with: 
        x_comb = torch.cat([x_text, x_eng], dim = 1).to(device)
        
        # pass x_comb through a couple more fully-connected layers and return output
        return self.combine(x_comb)

The main ideas from the other pipelines remain. We first train separately following similar procedures to above, then we concatenate the features and pass them through several more fully-connected layers. Notably changes come in our text pipeline where we removed a fully connected layer and our dropout. These changes were a result of trail and error testing. Additionally, we bring our separate pipelines together before they are compressed back into our four-class classification.

combined_model = CombinedNet(vocab_size, embedding_dim, num_class, num_features).to(device)
summary(combined_model, input_Size = (8, max_len))
=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
CombinedNet                              --
├─Sequential: 1-1                        --
│    └─Linear: 2-1                       414
│    └─ReLU: 2-2                         --
│    └─Linear: 2-3                       228
│    └─ReLU: 2-4                         --
│    └─Linear: 2-5                       104
├─Embedding: 1-2                         976,736
├─ReLU: 1-3                              --
├─Linear: 1-4                            264
├─Sequential: 1-5                        --
│    └─Linear: 2-6                       204
│    └─ReLU: 2-7                         --
│    └─Linear: 2-8                       104
│    └─ReLU: 2-9                         --
│    └─Linear: 2-10                      36
=================================================================
Total params: 978,090
Trainable params: 978,090
Non-trainable params: 0
=================================================================

Evidently, our model once again has a huge amount of trainable parameters. Lets see how they do!

EPOCHS = 25
for epoch in range(1, EPOCHS + 1):
    train(combined_model, train_loader, "both")
    print("     test accuracy  ", accuracy(combined_model, val_loader, "both"))
| epoch   1 | train accuracy    0.495 | time:  5.70s
     test accuracy   0.5423128046078866
| epoch   2 | train accuracy    0.564 | time:  5.37s
     test accuracy   0.5560478511298184
| epoch   3 | train accuracy    0.571 | time:  5.38s
     test accuracy   0.5724412937527692
| epoch   4 | train accuracy    0.584 | time:  5.38s
     test accuracy   0.5799734160389898
| epoch   5 | train accuracy    0.594 | time:  5.33s
     test accuracy   0.5516171909614532
| epoch   6 | train accuracy    0.611 | time:  5.45s
     test accuracy   0.5990252547629596
| epoch   7 | train accuracy    0.631 | time:  5.34s
     test accuracy   0.6322552060256978
| epoch   8 | train accuracy    0.644 | time:  5.38s
     test accuracy   0.615861763402747
| epoch   9 | train accuracy    0.654 | time:  5.48s
     test accuracy   0.6473194505981391
| epoch  10 | train accuracy    0.669 | time:  5.75s
     test accuracy   0.6464333185644661
| epoch  11 | train accuracy    0.677 | time:  6.30s
     test accuracy   0.6566238369517058
| epoch  12 | train accuracy    0.692 | time:  5.92s
     test accuracy   0.6544085068675233
| epoch  13 | train accuracy    0.699 | time:  6.66s
     test accuracy   0.6575099689853788
| epoch  14 | train accuracy    0.711 | time:  6.78s
     test accuracy   0.6575099689853788
| epoch  15 | train accuracy    0.720 | time:  7.07s
     test accuracy   0.6326982720425344
| epoch  16 | train accuracy    0.729 | time:  7.26s
     test accuracy   0.66371289322109
| epoch  17 | train accuracy    0.741 | time:  7.66s
     test accuracy   0.6548515728843598
| epoch  18 | train accuracy    0.748 | time: 10.54s
     test accuracy   0.6526362428001772
| epoch  19 | train accuracy    0.759 | time:  9.17s
     test accuracy   0.6570669029685423
| epoch  20 | train accuracy    0.765 | time:  8.58s
     test accuracy   0.6641559592379265
| epoch  21 | train accuracy    0.776 | time:  6.94s
     test accuracy   0.6495347806823216
| epoch  22 | train accuracy    0.784 | time:  6.50s
     test accuracy   0.6486486486486487
| epoch  23 | train accuracy    0.789 | time:  6.38s
     test accuracy   0.66371289322109
| epoch  24 | train accuracy    0.800 | time:  6.19s
     test accuracy   0.6544085068675233
| epoch  25 | train accuracy    0.811 | time:  6.24s
     test accuracy   0.6260522817899867
accuracy(combined_model, val_loader, "both")
0.6260522817899867

After twenty-five epochs, we achieved an accuracy of around 62% which is slightly disappointing. If we look closely at the evolution of the our testing accuracy, we were steadily in the region of around 65% for a while. This drop may be a part of the training process or may be a reflection of the beginning of our model overfitting to the training data.

per_class_accuracy(combined_model, val_loader, mode="both")
{0: 0.6839080459770115,
 1: 0.4872448979591837,
 2: 0.6977225672877847,
 3: 0.7046568627450981}

Curse you Jazz! We are doing significant better on all the other genres apart from jazz. This may be because of jazz lyrics being slightly less theme driven combined with the atypical structure of jazz music. Maybe swing rhythms, tempo changes and odd time signatures don’t fit neatly into any given category along with the lyrics.

Closing Remarks

Through our explorations, our metadata-only model yielded the highest accuracy around 65%, our combined network was not far behind with around 62% accuracy, and trailing begin was a purely lyric based approach that achieved 56% accuracy. Despite their varying and somewhat low accuracies, all the model outperformed the base rate of 36%. We narrowed down our search space to only four genres, hip hop, jazz, reggae, and rock. Of these genres, we had the easiest time distinguishing rock and reggae, with jazz proving especially hard to nail down.

This blogpost was obviously an exercise in crafting deep learning pipelines through applying themes we learned in readings and class (i.e. mean-pooling, non-linear activation functions, hidden layers, etc.) and simple trial and error. One large takeaway I had was that feature concatenation does not necessarily guarantee improved model accuracy, and in some cases can provide more noise than clarity to the model.

Some possible continuations for this project could be to modify model depth and complexity, implement vocabulary thresholds, expand the number of genres we look at, etc.